arXiv: 1505.04019v6 [cs.DS] 17 Sep 2015 


Linear-Time Superbubble Identification Algorithm for 

Genome Assembly 


Ljiljana Brankovic a,b , Costas S. Iliopoulos b , Ritu Kundu b , Manal 
Mohamed b , Solon P. Pissis b ’*, Fatima Vayani b 

“School of Electrical Engineering and Computer Science, The University of Newcastle, 

Newcastle NSW 2308, Australia. 
b Department of Informatics, King’s College London, 

London WC2R 2LS, United Kingdom 


Abstract 

DNA sequencing is the process of determining the exact order of the nu¬ 
cleotide bases of an individuars genome in order to catalogue sequence vari¬ 
ation and understand its biological implications. Whole-genome sequencing 
techniques produce masses of data in the form of short sequences known 
as reads. Assembling these reads into a whole genome constitutes a major 
algorithmic challenge. Most assembly algorithms utilise de Bruijn graphs 
constructed from reads for this purpose. A critical step of these algorithms 
is to detect typical motif structures in the graph caused by sequencing errors 
and genome repeats, and filter them out; one such complex subgraph class is a 
so-called superbubble. In this paper, we propose an 0(n + m )-time algorithm 
to detect all superbubbles in a directed acyclic graph with n vertices and m 
(directed) edges, improving the best-known 0(rn log m)-time algorithm by 
Sung et al. 
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1. Introduction 


Since the publication of the first draft of the human genome [l|,i2], the held 
of genomics has changed dramatically. Recent developments in sequencing 
technologies (see [3j, for example) have made it possible to sequence new 
genomes at a fraction of the time and cost required only a few years ago. 
With applications including sequencing the genome of a new species, an 
individual within a population, and RNA molecules from a particular sample, 
sequencing remains at the core of genomics. 

Whole-genome sequencing creates masses of data, in the order of tens of 
gigabytes, in the form of short sequences (reads). Genome assembly involves 
piecing together these reads to form a set of contiguous sequences (contigs) 
representing the DNA sequence in the sample. Traditional assembly algo¬ 
rithms rely on the overlap-layout-consensus approach [4j], representing each 
read as a vertex in an overlap graph and each detected overlap as a directed 
edge between the vertices corresponding to overlapping reads. These meth¬ 
ods have proved their use through numerous de novo genome assemblies [Ej. 

Subsequently, a fundamentally different approach based on de Bruijn 
graphs was proposed jhj, where representation of data elements was organ¬ 
ised around words of k nucleotides, or /c-mers, instead of reads. Unlike in an 
overlap graph, in a de Bruijn graph Q, each k — 1 nucleotide long prefix and 
suffix of the k -mers is represented as a vertex and each k -mer is represented 
as a directed edge between its prefix and suffix vertices. The marginal in¬ 
formation contained by a fc-mer is its last nucleotide. The sequence of those 
final nucleotides is called the sequence of the vertex. In a de Bruijn graph, 
the assembly problem is reduced to finding an Eulerian path, that is, a trail 
that visits each edge in the graph exactly once. 

However, sequencing errors and genome repeats significantly complicate 
the de Bruijn graph by adding false vertices and edges to it. Efficient and 
robust filtering methods have been proposed to simplify the graph by filtering 
out motifs such as tips, bubbles, and cross links, which proved to be caused 
by sequencing errors [8j. In particular, a bubble consists of multiple directed 
unipaths (where a unipath is a path in which all internal vertices are of degree 
2) from a vertex v to a vertex u and is commonly caused by a small number 
of errors in the centre of reads. Although these types of motifs are simple 
and can easily be identified and hltered out, more complicated motifs prove 
to be more challenging. 

Recently, a complex generalisation of a bubble, the so-called superbub- 
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ble, was proposed as an important subgraph class for analysing assembly 
graphs |9(. A superbubble is defined as a minimal subgraph H in the de 
Bruijn graph with exactly one start vertex s and one end vertex t such that: 
(1) id is a directed, acyclic, single-source (s), single-sink (t) graph (2) there 
is no edge from a vertex not in H going to a vertex in H\{s} and (3) there 
is no edge from a vertex in H\{t} going to a vertex not in H. It is clear 
that many superbubbles are formed as a result of sequencing errors, inexact 
repeats, diploid/polyploid genomes, or frequent mutations. Thus, efficient 
detection of superbubbles is essential for the application of genome assem¬ 
bly 

Onodera et al. gave an 0(nm )-time algorithm to detect superbubbles, 
where n is the number of vertices and m is the number of edges in the 
graph [9j. Very recently, Sun g et al. gave an improved 0{m log rn)-tirne al¬ 
gorithm to solve this problem [10j. The algorithm partitions the given graph 
into a set of subgraphs such that the set of superbubbles in all these sub¬ 
graphs is the same as the set of superbubbles in the given graph. This set 
consists of subgraphs corresponding to each non-singleton strongly connected 
component and a subgraph corresponding to the set of all the vertices in¬ 
volved in singleton strongly connected components. Superbubbles are then 
detected in each subgraph; if it is cyclic, it is first converted into a directed 
acyclic subgraph by means of depth-first search and by duplicating some 
vertices. 

Our Contribution. Note that the cost of partitioning the graph and 
transforming it into the directed acyclic subgraphs is linear with respect to 
the size of the graph. However, computing the superbubbles in each directed 
acyclic subgraph requires 0(m\ogm) time [lo|, which dominates the time 
bound of the algorithm. In this paper, we propose a new 0{n + m)-time 
algorithm to compute all superbubbles in a directed acyclic graph. 

This paper is organised as follows: In Section 2 we define superbub¬ 
bles and introduce some of their properties, and in Section 3 we outline the 
0(n + m)-time algorithm for computing superbubbles in a directed acyclic 
graph. In Section 4 we explain a method to validate a candidate superbub¬ 
ble in constant time. The algorithm is analysed in Section 5, while Section 
6 provides some final remarks and directions for future research. 
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2. Properties 

The concept of super bubbles was introduced and formally defined in f9| as 
follows. 

Definition 1 ([9|). Let G = (V, E) be a directed graph. For any ordered 
pair of distinct vertices s and t, (s,t) is called a superbubble if it satisfies 
the following: 

• reachability: t is reachable from s; 

• matching: the set of vertices reachable from s without passing through 
t is equal to the set of vertices from which t is reachable without passing 
through s; 

• acyclicity: the subgraph induced by U is acyclic, where U is the set of 
vertices satisfying the matching criterion; 

• minimality: no vertex in U other than t forms a pair with s that 
satisfies the conditions above; 

vertices s and t, and U\{s,t} used in the above definition are the superbub¬ 
ble ’s entrance, exit and interior, respectively. 

We note that a superbubble (s,t) in the above definition is equivalent 
to a single-source, single-sink, directed acyclic subgraph of G with source s 
and sink t, which does not have any cut vertices and preserves all in-degrees 
and out-degrees of vertices in U\{s,t}, as well as the out-degree of s and 
in-degree of t. 

We next state a few important properties of superbubbles which enable 
the linear-time enumeration of superbubbles. Lemmas |T| and [2] were proved 
by Onodera et al. [9] and Sung et al. fioj ]. respectively. 

Lemma 1 ([IJ). Any vertex can be the entrance (respectively exit) of at most 
one superbubble. 

Note that Lemma [Tj does not exclude the possibility that a vertex is the 
entrance of a superbubble and the exit of another superbubblc. 


4 


Lemma 2 ( flo| ). Let G be a directed acyclic graph. We have the following 
two observations. 

1) Suppose (p, c) is an edge in G, where p has one child and c has one 
parent, then (p, c) is a superbubble in G. 

2) For any superbubble (s,t) in G, there must exist some parent p of t 
such that p has exactly one child t. 

In this paper we start by showing another important property of super- 
bubbles that is closely-related to Lemma |2j 

Lemma 3. For any superbubble ( s,t) in a directed acyclic graph G, there 
must exist some child c of s such that c has exactly one parent s. 

Proof. Assume that all the children of s have more than one parent. Then, 
there must be some cycle or some child c which has a parent that does not 
belong to the superbubble (s,t). This is a contradiction. □ 


3. Finding a Superbubble in a Directed Acyclic Graph 


The main contribution of this paper is an algorithm SuperBubble that 
reports all superbubbles in a directed acyclic graph G = (V, E) with exactly 
one source (the vertex with in-degree 0) and exactly one sink (vertex with 
out-degree 0). If G has more than one source then a new source vertex r' is 
added to V and an edge from r' to each existing source is added to E. The 
same is done if G has more than one sink; in this case, a new sink vertex 
t' is added to V and an edge from each existing sink to t' is added to E. 
If such preprocessing is done, then among the superbubbles reported by the 
algorithm, only those which do not start at r' and do not end at t' represent 
the superbubbles in the original graph. For the sake of simplicity, for the rest 
of this paper and in all the propositions, lemmas and theorems that follow, 
we use G to denote a directed acyclic graph with exactly one source and 
exactly one sink, and we use n and m to denote the number of its vertices 
and edges respectively, that is, for G = (V. E) we have n — \V\ and m = \E\. 

A topological ordering ordD of G maps each vertex to an integer between 
1 and n, such that ordD[x] < ordD[y} holds for all edges (x, y ) G E. There 
exists a classical linear-time algorithm for computing the topological order¬ 


ing of a directed acyclic graph [111 . [12]. In its recursive form, the algorithm 


visits an unvisited vertex of the graph, finds its unvisited neighbour, say v, 
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and performs another topological sort starting from v. The algorithm re¬ 
turns if the current vertex does not have unvisited neighbours. Algorithm 
TopologicalSort, given below, is a simplified version that takes as input 
a single-source, single-sink directed acyclic graph, and produces a topologi¬ 
cal ordering of vertices. For the graph G in Figure CD TopologicalSort 
produces an ordering given in Figure [21 

TopologicalSort(G) 

1 ordern 

2 for each vertex v 6 V do 

3 state[v\ <— unvisited 

4 RecursiveTopologicalSort(G, source) 

RecursiveTopologicalSort(G, v) 

1 state[v\ visited 

2 for each vertex w adjacent to v do 

3 if state [ic] = unvisited then 

4 RecursiveTopologicalSort(G, w) 

5 ordD[v] order 

6 order ■<— order — 1 


Proposition 1. For any topological ordering ordD of vertices in graph G, 
if vertex u is reachable from v, that is, if there is a path from v to u, then 
ordD[v] < ordD[u\. 

PROOF. If the path from v to u is of length 1, i.e., there is an edge (v, u ), then 
by the definition of topological ordering we have ordD[v] < ordD[u\. Other¬ 
wise, we denote the path from v to u of length k, k > 1, as v, Xi ,..., Xk- 1 , u. 
Then by the definition of topological ordering we have ordD[v] < ordD[x i] < 
• • • < ordD[u]. Transitively, we have ordD[u] < ordD[u]. □ 

Importantly, in this paper we do not consider all topological orderings of 
graph G but only those obtained by algorithm TopologicalSort. Note 
that this algorithm finds a directed spanning tree T of G rooted at the 
source, which contains a path from the source to any vertex in G. The 
directed spanning tree T of G obtained by algorithm TopologicalSort 
is presented by bold edges in Figure [2j It may be worth mentioning that a 
directed rooted tree is also known as arborescence. 
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We next present another important property of topological ordering ob¬ 
tained by algorithm TopologicalSort. 

Proposition 2. Let ordD and T be a topological ordering and a directed 
rooted spanning tree of graph G obtained by algorithm TopologicalSort. 
If there is a path in T from a vertex v to a vertex u, then, for each vertex w 
such that ordD[v] < ordD[w] < ordD[u], there is a path from v to w. 

PROOF. Recall that T contains a path from the root to each vertex of the 
tree; this is also true for each subtree of T. Furthermore, if there is a path 
from v to u in T, then u is contained in a subtree of T rooted at v, and each 
w such that ordD[v] < ordD[w} < ordD[u) is also contained in the subtree 
rooted at v (but not in the subtree rooted at u). Therefore, there is a path 
from v to w, for each iv such that ordD[v] < ordD[w] < ordD[u\. □ 

We next show that in an ordering obtained by TopologicalSort, a 
vertex has the topological ordering between the orderings of the entrance 
and the exit of a superbubble if and only if it belongs to the superbubble. 

Lemma 4. Let graph G contain a superbubble ( s , t). Then a topological 
ordering obtained by TopologicalSort has the following properties. 

1. For all x such that x G U\{s,t}, ordD[s\ < ordD[x] < ordD[t\. 

2. For all y such that y fL U, ordD[y\ < ordD[s\ or ordD[y\ > ordD[t}. 

Proof. Recall that U is the set of vertices forming a superbubble (see Def¬ 
inition [T]) . 

1. Since there is a path from the start s of the superbubble to all x G 

U\{s}, by Proposition |T] we have ordD[s] < ordD[x] for all x such that 
x G U\{s}. Similarly, since there is a path from all x G [/\{£} to the 
end t of the superbubble, by Proposition Q] we have ordD[x] < ordD[t] 
for all x such that x G Therefore, for all x such that x G 

U\{s,t}, ordD[s] < ordD[x] < ordD[t\. 

2. Suppose the opposite, that is, suppose that there exists some y (f U 
such that ordD[s\ < ordD[y] < ordD[t]. Since the superbubble (s,t) is 
itself a single-source, single-sink subgraph of G, any directed spanning 
tree of G rooted at the source, will contain a path from s to t. Then 
by Proposition [2] there also exists a path from s to y in T and thus also 


7 



Figure 1: A graph G with set of vertices V = {t>i, U 2 > ■' * ,^ 15 }- Note that G has as a 
single source v\ and as a single sink vu. 



Figure 2: Vertices of Figure Q] in topological order, where ordD[v 1 ] = 1, ordD[v 2 ] = 2, 
ordD[v 3 ] = 3, ordD[v, 1 ] = 11, ordD[v 5 ] = 6 , ordD[ve] = 8 , = 10, = 12, 

ordD[vg] = 7, ordD[v 10 ] = 9, ord£>[un] = 4, ordD[vi 2 ] = 5, ordD[v 13 ] = 13, ordD[i>i 4 ] = 15 
and ordD[v 15 ] = 14 


in G. However, by the definition of the superbubble, the only vertices 
reachable from s without going through t are the internal vertices of the 
superbubble — a contradiction. Therefore, for all y such that y U , 
either ordD[y] < ordD[s ] or ordD[y\ > ordD[t]. □ 

Algorithm SuperBubble starts by topologically ordering the vertices of 
graph G and then checks each vertex in V. in topological order, to identify 
whether it is an exit or an entrance candidate (or both). According to Lem¬ 
mas [2] and |3l a vertex v is an exit candidate if it has at least one parent with 
exactly one child (out-degree 1) and an entrance candidate if it has at least 
one child with exactly one parent (in-degree 1). There are at most 2 n candi¬ 
dates, thus the cost of constructing a doubly-linked list of all the candidates 
is linear in n. The elements of the candidates list are ordered according to 
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Figure 3: Candidates list for Figure[T| candidates = {wi(entrance), 113 (exit), ^(entrance), 
un (entrance), UiRexit), V 5 (entrance), uio(exit), 17 (exit), us(exit), vs (entrance), 
U 13 (entrance), UiRexit)}. Note that both V 3 and vs appear twice in the list. 


ordD , and each candidate is labelled as an exit or an entrance candidate. 
Note that if a vertex v is both an exit and an entrance candidate, then v 
appears twice in the candidates list, first as an exit and then as an entrance 
(Figure EJ). 

Algorithm SuperBubble processes the candidates list of graph G in 
decreasing topological order (backwards). Let v[,v' 2 ,... ,v' £ be the list of 
candidates. The algorithm performs the following: 

• If v'j is an entrance candidate, then delete w'; 

• If v'j is an exit candidate, then subroutine ReportSuperBubble is 
called to find and report the superbubble ending at u', that is, the 
superbubble (u',u'), for some entrance candidate v\. Subroutine Re¬ 
portSuperBubble also recursively finds and reports all nested su¬ 
perbubbles between v\ and v' rj . 

For clarity of presentation, we next provide a list and a short description 
of subroutines and arrays used by algorithm SuperBubble and subroutine 
ReportSuperBubble. Before that, it is worth mentioning that candidates 
is a doubly-linked list of entrance and exit candidates; specifically, an element 
of the list is a vertex along with a label specifying if it is an entrance or an 
exit candidate. For the sake of simplicity of the following routines, we use 
a vertex and its corresponding candidate (element in the candidates list) 
interchangeably. This does not add to the complexity of the algorithm as we 
can use an auxiliary array v, where v[i] stores a pointer to the corresponding 
element Vi in candidates so as to provide a constant-time conversion from a 
vertex to the corresponding candidate. 

1. Entrance^) takes as input a vertex v and outputs TRUE if v is an 
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entrance candidate, that is, if it satisfies Lemma El and FALSE other¬ 
wise. 

2. Exit(v) takes as input a vertex v and outputs TRUE if v is an exit 
candidate, that is, if it satisfies Lemma [2] and FALSE otherwise. 

3. InsertEntrance('u) takes as input a vertex v, inserts it at the end 
of candidates and labels it as entrance. 

4. InsertExit(v) takes as input a vertex v, inserts it at the end of can¬ 
didates and labels it as exit. 

5. Head (candidates) and Tail {candidates) return the first and the last 
element in candidates, respectively. 

6. DeleteTail( candidates) deletes the last element in candidates. 

7. Next(c) returns the candidate following v in candidates. 

In addition to the above subroutines, the main algorithm also explicitly 
makes use of the following arrays. 

1. The array ordD stores the topological order of the vertices. 

2. The array previous Entrance stores the previous entrance candidate s for 
each vertex v. Formally, previousEntrance[v] = s where s is an entrance 
candidate such that ordD[s] < ordD[v\, and there does not exist another 
entrance candidate s' such that ordD[s ] < ordD[s'} < ordD[v\. 

3. The array alternativeEntrance is used to reduce the number of entrance- 
exit pairs that need to be considered as possible superbubbles. Array 
alternativeEntrance is further detailed in Section 14.11 

Note that subroutine ReportSuperBubble is called for each exit can¬ 
didate in decreasing order either by algorithm SuperBubble or through a 
recursive call to identify a nested superbubble. A call to subroutine REPORT- 
SuPERBuBBLE(stert, exit) checks the possible entrance candidates between 
start and exit, starting with the nearest previous entrance candidate (to exit). 
This task is accomplished with the help of subroutine VALIDATESUPER- 
Bubble, explained in the following section, which checks whether a given 
candidate ( s,t ) is a superbubble or not; if it is not, the algorithm returns 
either a “-1” which means that no superbubble ends at t, or an alternative 
entrance candidate for a superbubble that could end at t. For the graph 
in Figured! the algorithm detects and reports five superbubbles: (u 8 ,Ui 4 ), 
{v 3 ,v 8 ), (v 5 ,v 7 ), (v u ,v 12 ) and {v 1} v 3 ). Here, both (v 5 ,v 7 ) and (v u ,v 12 ) are 
nested superbubbles. 
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SuperBubble(G) 

1 TopologicalSort(G) 

2 prevEnt <— NULL 

3 for each vertex v in topological order do 

4 alternativeEntrance[v\ <— NULL 

5 previousEntrance[v] <— prevEnt 

6 if Exit(v) then 

7 InsertExit(d) 

8 if Entrance(u) then 

9 InsertEntrance(u) 

10 prevEnt •(— v 

11 while candidates is not empty do 

12 if ENTRANCE(TAlL(candidates)) then 

13 DeleteTail( candidates) 

14 else REPORTSuPERBuBBLE(HEAD(candidates),TAlL(canoMates)) 


ReportSuperBubble (start, exit ) 

1 > Report the superbubble ending at exit (if any) 

2 if (start = NULL) | (exit — NULL) | ( ordD[start} > ordD[exit ]) then 

3 DeleteTail( candidates) 

4 return 

5 s {— previousEntrance[exit] 

6 while (ordD[s\ > ordD[start\ ) do 

7 valid <- ValidateSuperBubble(s, exit) 

8 if (valid = s ) | (valid = altemativeEntrance[s\) \ (valid = —1) then 

9 break 

10 alternativeEntrance[s\ <— valid 

11 s i — valid 

12 DeleteTail( candidates) 

13 if (valid = s) then 

14 Report((s, exit)) 

15 while (Tail (candidates) is not s) do 

16 if ExiT(TAlh(candidates)) then 

17 > Check for nested superbubbles 

18 ReportSuperBubble(Next(s), Tail (candidates)) 

19 else DeleteTail (candidates) 

20 return 
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Remark 1. It is also possible to design the algorithm so as to move forward 
in topological order instead of backwards. 

For graph G in Figure [H algorithm SuperBubble(G') makes exactly three 
calls to subroutine ReportSuperBubble: 

1. ReportSuperBubble^!, vu)\ First, it checks the exit candidate 
against the nearest previous entrance candidate, i.e. vertex ui 3 . Sub¬ 
routine ValidateSuperBubble(ui 3 , ui 4 ) returns v% as an alternative 
entrance candidate. The new candidate is then checked and the super¬ 
bubble (vg, U 14 ) is reported. 

2. ReportSuperBubble(ui, v 8 ): First, it checks the exit candidate Vg 
against the nearest previous entrance candidate, i.e. vertex v$. Sub¬ 
routine ValidateSuperBubble(u 5 , v 8 ) returns n 3 as an alternative 
entrance candidate. The new candidate is then checked and the super¬ 
bubble (n 3 ,Ug) is reported. Additionally, two recursive calls are made: 

(a) ReportSuperBubble(uii, v 7 ): First, it validates {v5,v 7 ) and re¬ 
ports it. Then, it makes a recursive call to subroutine Report- 
SuperBubble(uio, Uio) which terminates without reporting any 
superbubble. 

(b) ReportSuperBubble(uii, v 12 )- validates (^11,^12) and reports 
it. 

3. ReportSuperBubble^!, u 3 ): validates (ni,n 3 ) and reports it. 


4. Validating a Superbubble 


In this section, we describe subroutine ValidateSuperBubble. The 
ability to validate a candidate superbubble depends on the following result 
related to the Range Minimum Query problem. 

The Range Minimum Query problem, RMQ for short, is to preprocess 
a given array A[ 1.. n] for subsequent queries of the form: “Given indices 
i,j , what is the minimum value of A[i. .j]T\ The problem has been studied 
intensively for decades and several (0(n), 0(1))-RMQ data structures have 
been proposed, many of which depend on the equivalence between the Range 
Minimum Query and the Lowest Common Ancestor problems 
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In order to check whether a superbubble candidate (s, t ) is a superbubble 
or not, we propose to utilise the range min/max query problem as follows: 
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1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 


Vl v 2 V 3 Vu Vi2 v 5 Vg V 6 V 10 V 7 V A V 8 V 13 V 15 V U 

OutParent[j} -11 3 4366 7835 12 13 12 

OutChild{j} 3 3 11 5 12 8 9 10 10 12 12 15 15 15 - 


Figure 4: OutParent and OutChild arrays for the graph in Figure [T] 

• For a given graph G = (V, E ) and for each vertex v G V with topological 
order ordD[v ], calculate the topological orderings of the parent and the 
child of v that are topologically furthest from v. 


OutParent[ordD[v]\ = min ({ ordD[ui\ \ ( Ui,v ) G E}), 
OutChild[ordD[v ]] = max ({ ordD[ui\ \ (v,Ui) G E}). 

• For an integer array A and indices i and j we define RangeMin(7L, i, j) 
and RangeMax(t4, i, j) to return the minimum and maximum value 
of A[i..j\, respectively. 

Then for a given superbubble candidate (s, t), where s and t are an en¬ 
trance and an exit candidate respectively (satisfying Lemmas [U and [21) , 
if (s,t) is a superbubble then the following two conditions are valid 


RANGEMlN(0«tParenf,ordR[s]+l,or(7D[t]) = ordD[s\, 
PANGEM.Ax(OutChild : ordD[s],ordD[t]-l) = ordD[t\. 

For example, Figure |4] represents both OutParent and OutChild arrays 
computed for graph G in Figure [D Furthermore, a candidate (v 3 , v 8 ) is not 
a superbubble as RangeMin (OutParent, ordD[v 5 ] + 1 , ordD[v s \) = 3^6 = 
ordD[v 5 ]. 

ft should be clear that after an 0(n + m)-time preprocessing, validat¬ 
ing a superbubble requires 0 ( 1 ) time which is the cost for range max/min 
query. Subroutine VALlDATESuPERBuBBLE(starf Vertex, endVertex) is de¬ 
signed to return an appropriate entrance candidate for a superbubble ending 
at endVertex (if any), as follows. 
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VALlDATESuPERBuBBLE(star£ Vertex, endVertex ) 

1 startordD[startVertex] 

2 end ordD[endVertex] 

3 outchild RangeMax(0u*C7mM, start, end — 1) 

4 outparent RangeMin( OutParent, start + 1, end) 

5 if outchild ^ end then 

6 return —1 

7 if outparent = start then 

8 return startVertex 

9 elseif Entrance (Vertex (emtparenf)) then 

10 return V ertex ( outparent) 

11 else return previous Entrance\VERTEx( outparent)] 

Note that subroutine ValidateSuperBubble utilises subroutine En¬ 
trance and the array previous Entrance defined in Section |3l as well as 
subroutine Vertex that takes as input an integer i and outputs vertex v 
such that ordD[i] = v. 

An important observation is that a subsequent call to subroutine VALI- 
DATESuperBubble, for a given entrance candidate, returns alternative en¬ 
trance candidates in strictly non-decreasing topological order as proved by 
Lemma [5] 

Lemma 5. Let t be the alternative entrance candidate returned by subroutine 
ValidateSuperBubble(s, e) . Then for any exit candidate e' such that 
ordD[s\ < ordD[e'] < ordD[e], the order of the alternative entrance candidate 
t' returned by subroutine ValidateSuperBubble(s, e!) will be greater than 
or equal to the order oft. 

PROOF. Recall that the alternative entrance t returned by the subroutine 
ValidateSuperBubble(s, e') is either a vertex with topological order out- 
perent, or the previous Entrance of this vertex. 

Since outparent = RangeMin( OutParent, ordD[s] + l, ordD[e \), outparent J = 
RangeMin( OutParent, ordD[s\+ 1, ordD[e']) and ordD[s\ < ordD[e'} < ordD[e ], 
we have outparent < outparent J . Therefore, ordD{t) < ordD{t'). □ 

4-1. Validation and alternativeEntrance 

In case the validation of the candidate pair (to,e) fails, subroutine Val- 
lDATESuPERBuBBLE(fo, e) returns either “-1” or an alternative candidate 


14 


1 1 which might be an entrance of a superbubble ending at e. This alterna¬ 
tive candidate t\ is either a vertex tq, if u\ is an entrance candidate, or the 
previous entrance candidate of u± such that 


ordD[u\} = OutParent[ordD[vi]] 

= RANGEMlN(OutParent,ordD[t 0 ] + 1 ,ordD[e]), 

where V\ is some vertex between f 0 and e in the topological ordering. 

Suppose ti is also not a valid entrance of the superbubble ending at e. 
Then there must be a vertex V 2 , ordD[t\] < ordD[v 2 \ < ordD[to], with some 
parent u 2 , such that ordD[u 2 } = OutParent[ordD[v 2 ]]. Then the alternative 
entrance is some t 2 , which is either a vertex u 2 or its previous entrance 
and thus ordD[t 2 } < ordD[ti\. A series of such failed validations produces a 
sequence ti,t 2 , ... of failed alternative entrance candidates. 

An important observation here is that any entrance ti, for i > 1, from such 
a sequence is an invalid entrance not only for the superbubble ending at e 
but also for all those ending at any other exit vertex e! such that ordD[ti_i] < 
ordD[e'} < ordD[e] and ti = VALlDATESuPERBuBBLE(tj_ 1 , e'). This is the 
case because the vertex Vi which causes the alternative entrance ti to fail is 
such that ordD[ti] < ordD[vi] < ordD[ti_ i] for i > 1. Therefore, Vi does not 
depend on the exit e but rather on the previous failed candidate entrance. 

This is where array alternativeEntrance plays an important role. Storing 
alternativeEntrance[ti_i) = U for i > 1 enables us to skip this sequence at a 
later stage if ti is returned by subroutine ValidateSuperBubble(A_i, e!\ 

5. Algorithm Analysis 

In this section, we analyse the correctness and the running time of the 
proposed algorithm SuperBubble. For simplicity, in the following lemma 
we will slightly abuse the terminology and refer to ( s,t) as a superbubble if 
it satisfies the first three conditions given in Definition [U and as minimal 
superbubble if it also satisfies the last condition in the same definition. 

Lemma 6. For a given exit candidate e, let s be the entrance candidate such 
that superbubble ( s , e) is reported by subroutine ValidateSuperBubble(s, e). 
Then ( s , e) is a minimal superbubble. 
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Proof. By contradiction, let e! be an exit candidate such that (s, e') is also 
a superbubble and ordD[s] < ordD[e'] < ordD[e\. Then, either ordD[e] = 
ordD[e'} + 1 or there is at least one vertex v such that ordD[e '] < ordD[v] < 
ordD[e\. 

In the first case, ordD[e\ = ordD[e'] +1 implies that e is the only child of e! 
and e 7 is the only parent of e, which, by Lemma [5] makes (e 7 , e ) a superbubble. 

In the second case, where there is at least one vertex v such that ordD[e 7 ] < 
ordD[v] < ordD[e], we also argue that (e 7 , e) must be a superbubble. Indeed, 
(e 7 , e) satisfies the following conditions: 

1. Reachability: Since (s, e) is a superbubble, e is reachable from s; 
since (s, e 7 ) is also assumed to be a superbubble, any path from s to e 
must go through e 7 , therefore e is reachable from e’. 

2. Matching: The only vertices reachable from e 7 without going through 
e are those whose topological order is between ordD(e') and ordD(e). 
Indeed, since (s, e) and ( s , e!) are superbubbles, all these vertices are 
reachable from s through e 7 , and no vertices with topological order 
greater than ordD(e) are reachable from e' without going through e. 
Similarly, there are no edges between vertices with topological order less 
than ordD(e') and those with the topological order between ordD(e') 
and ordD(e). Therefore, the only vertices from which e is reachable 
without going through e' are those whose topological order is between 
ordD(e') and ordD(e). 

3. Acyclicity: Since (s, e) is a superbubble it is acyclic; since (e', e) is a 
subgraph of (s,e), it is also acyclic. 

In both cases, since for each exit candidate the entrance candidates are 
checked in reverse topological order, subroutine ValidateSuperBubble 
would have been called on (e 7 , e) first, and would have reported (e 7 , e) instead 
of (s,e). Therefore, (s,e) is a minimal superbubble. □ 

Lemma 7. For the given entrance and exit candidates s and e, respectively, 
subroutine ValidateSuperBubble reports {s,t ) , if and only if, ( s,t) is a 
superbubble. 

Proof. We prove the lemma by showing that if (s,t) is a superbubble 
then subroutine ValidateSuperBubble reports it, and if VALIDATESu- 
PERBUBBLE reports (s,t) then (s,£) is a superbubble. 
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1. We start by showing that if ( s, t) is a superbubble then subroutine Val- 
idateSuperBubble reports it. Indeed, by Lemma 01 all the vertices 
with topological orderings between s and t belong to the superbub¬ 
ble (s,t). Therefore, the minimum OutParent is s and the maximum 
OutChild is t and thus subroutine ValidateSuperBubble reports 
<M>. 

2. We next show that if subroutine ValidateSuperBubble reports (s, t) 
then (s, t) is a superbubble. Let start and end be two integers, such 
that ordD[s] = start and ordD[t] = end. The graph G as defined, has a 
single source r and a single sink r'; this implies that any vertex v 6 V is 
reachable from r and, at the same time, can reach r'. This is also true 
for s, t and for any vertex v such that ordD[s\ < ordD[v] < ordD[t ]. 
First, we show that t is reachable from s. Recall that t is an exit 
candidate, so, it has a parent p with out-degree 1. Assume that t is not 
reachable from s, then there must be a path from r ^ t which does 
not involve s. This implies that either OutParent[end\ < start, or there 
exists a vertex v such that start < ordD[v] < end, OutParent[v] < start 
and there exists a path r v t, which is a contradiction. 

Similarly, we can show that every vertex v such that start < ordD[v] < 
end satisfies the matching criterion of the superbubble. 

The acyclicity criterion is guaranteed by the acyclicity of G and the 
minimality is satisfied by the design of subroutine ReportSuper- 
Bubble which assigns each exit of a superbubble to the nearest en¬ 
trance, and by the correctness of Lemma [6l □ 

Lemma 8. For a given exit candidate e, let t be the alternative entrance 
candidate returned by subroutine ValidateSuperBubble(s, e). Then any 
entrance candidate between t and e cannot be a valid entrance for the super¬ 
bubble ending at e. 

Proof. By contradiction, assume that s' is an entrance candidate between 
t and e such that (s', e) is a superbubble. If s' had been between s and e, it 
would have already been reported, as SuperBubble checks entrance candi¬ 
dates in reverse topological order starting from e. Therefore, s' is between 
t and s, such that ordD[t] < ordD[s'] < ordD[s) < ordD[e]. Let outparent = 
RangeMin (OutParent, ordD[s\ + 1, ordD[e\). Then, vertex at outparent is 
between t and s', otherwise subroutine ValidateSuperBubble(s, e) would 
have returned s' (instead of t). Therefore, ordD[t} < outparent < ordD[s'}. 


17 


Let outpareni! = RangeMin( OutParent, ordD[s'} + 1, ordD[e\). Then 
outpareni! < outparent. This implies that outpareni! < outparent < ordD[s']. 
However, for (s', e) to be a valid superbubble, outpareni! should have been 
equal to ordD[s']. Hence, the assumption is wrong and thus, it is proved that 
there cannot be an entrance candidate, between t and e, which is a valid 
entrance for the superbubble ending at e. □ 

Lemma 9. For the given entrance and exit candidates s and e±, respectively, 
let alternativeEntrance[s\ be set to ti which later gets reset to t 2 such that 
t 2 7 ^ ti, while considering s with another exit candidate e 2 . Then no entrance 
candidate between s and e 2 can reset alternativeEntrance[s\ to R again. 

Proof. Let e 3 be an exit candidate between s and e 2 such that subroutine 
ValidateSuperBubble(s, e 3 ) returns 1 3 . Then by Lemma [5j ordD[ti] < 
ordD[t 2 ] < ordD[t 3 \. Since t\^ t 2 , we have ordD[ti] < ordD[t 2 ] < ordD[t 3 ]. 
Therefore, ti < t 3 and alternativeEntrance[s\ cannot be reset to the same 
value t\ again. □ 

Theorem 1. Algorithm SuperBubble reports all superbubbles, and only 
superbubbles, in graph G in decreasing topological order of their exit vertices 
in 0(n + m)-time. 

Proof. Consider an execution of algorithm SuperBubble. Let super¬ 
bubbles (si, ti), - ■ • , (sfc, t k ) be the successive superbubbles reported just af¬ 
ter the execution of Line dH of subroutine ReportSuperBubble, where 
ordD(ti ) > ordD(t 2 ) > • • • > ordD(t k ). 

1 . First, we show that each (s t , t 3 ) reported by the algorithm in Line ITT 
is a superbubble. This is proved by the correctness of Lemma CO 

2 . Second, no superbubblc is missed out by the algorithm as proved by 
the following. Subroutine ReportSuperBubble is called for each exit 
candidate in decreasing order. The entrance candidate for the super¬ 
bubble (if any) ending at exit will only be between start and exit, where 
start is either the head of the the candidates list (when subroutine Re- 
PORtSuperBubble is called from algorithm SuperBubble) or next 
candidate of the entrance of an outer superbubble (when called through 
a recursive call to identify a nested superbubble). A call to subroutine 
ReportSuperBubble (start, exit) checks the possible entrance candi¬ 
dates between start and exit, starting with the nearest previous entrance 
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candidate (to exit). Subroutine ValidateSuperBubble either suc¬ 
cessfully validates an entrance candidate, or returns a “-1”, or returns 
an alternative entrance candidate. From Lemma [8] there cannot be 
any valid entrance between this alternative entrance and exit. If this 
alternative entrance starts a sequence of entrances already checked for 
some exit candidate previously (as depicted by alternativeEntrance ), 
then all entrances of that sequence will be skipped, otherwise this al¬ 
ternative entrance will be tested. However, as mentioned in Section I47T1 
none of the entrance candidates in the skipped sequence can be valid. 
Therefore, for each exit candidate, every potential entrance candidate 
is checked for validity, and those which are not considered are not valid. 

3. Third, the running time of SuperBubble is 0(n + m). Indeed, the 
running time of the TOPOLOGICALSORT and computing the candidates 
list is 0(n + m). Furthermore, all list operations cost constant time 
each, and sum up to a linear cost of O(n), as there are at most 2 n can¬ 
didates in the list. Finally, each call for subroutine VALIDATESUPER- 
Bubble costs 0(1). The total number of times ValidateSuperBub¬ 
ble is called is 0(n+m). This is because subroutine ValidateSuper¬ 
Bubble is called once for each exit candidate in Line 7 of subroutine 
ReportSuperBubble, and the total number of such calls is bounded 
by O(n). Additionally, it is called every time a new alternativeEn¬ 
trance sequence is generated by subroutine ValidateSuperBubble. 
It follows from Lemma [U] that once an alternativeEntrance sequence 
is reset, it cannot be generated again by subsequent calls to subrou¬ 
tine ValidateSuperBubble. This resetting of alternativeEntrance 
for each entrance candidate (Line [TO]) thus enables avoiding repeated 
checks of the same sequences of entrance candidates. Resetting is done 
every time an edge is considered for the first time between a vertex 
(in between an entrance candidate startVertex and an exit candidate 
endVertex) and its topologically furthest parent (whose order is less 
than that of startVertex ). Thus, the total number of times alterna¬ 
tiveEntrance will be reset (for all the entrance candidates) is bounded 
by 0(m). 

Therefore, the total running time for reporting all superbubbles in 
graph G is 0{n + m). 


□ 
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6. Final Remarks 


We presented an 0(n + m)-time algorithm to compute all superbubbles 
in a directed acyclic graph, where n is the number of vertices and m is the 
number of edges, improving the best-known 0(m log m)-time algorithm for 
this problem lo|| . It is also interesting to note that in this type of graph, that 


is, constructed from sequences over a fixed-sized alphabet, the out-degree of 
each vertex is bounded by the size of the alphabet (four for DNA alphabet); 
therefore, the time complexity of the proposed algorithm is essentially linear 


m n. 


Our immediate goal is to practically evaluate our algorithm and com¬ 
pare its implementation to an earlier result [ 9 ]. It would also be interesting 
to investigate other superbubble-like structures in assembly graphs, such as 
complex bulges 


References 

[1] E. S. Lander, L. M. Linton, B. Birren, C. Nusbaum, M. C. Zody, J. Bald¬ 
win, K. Devon, K. Dewar, M. Doyle, W. FitzHugh, et ah, Initial se¬ 
quencing and analysis of the human genome, Nature 409 (6822) (2001) 
860-921. 

[2] J. C. Venter, M. D. Adams, E. W. Myers, P. W. Li, R. J. Mural, G. G. 
Sutton, H. O. Smith, M. Yandcll, C. A. Evans, R. A. Holt, et ah, The 
sequence of the human genome, Science 291 (5507) (2001) 1304-1351. 

[3] S. Balasubramanian, D. Klcnerman, C. Barnes, M. Osborne, Patent 
US20077232656 (2007). 

[4] S. Batzoglou, Algorithmic challenges in mammalian genome sequence as¬ 
sembly, Encyclopedia of genomics, proteomics and bioinformatics, John 
Wiley and Sons, Hoboken (New Jersey). 

[5] J. Butler, I. MacCallum, M. Kleber, I. A. Shlyakhter, M. K. Belmonte, 
E. S. Lander, C. Nusbaum, D. B. Jaffe, ALLPATHS: de novo assembly 
of whole-genome shotgun microreads, Genome Research 18 (5) (2008) 
810-820. 

[6] P. A. Pevzner, H. Tang, M. S. Waterman, An Eulerian path approach 
to DNA fragment assembly, Proceedings of the National Academy of 
Sciences of the U. S. A. 98 (17) (2001) 9748-9753. 


20 



[7] N. G. de Bruijn, A combinatorial problem, Koninklijke Nederlandse 
Akademie v. Wetenschappen 49 (1946) 758-764. 

[8] D. R. Zerbino, E. Birney, Velvet: algorithms for de novo short read 
assembly using de Bruijn graphs, Genome Research 18 (5) (2008) 821- 
829. 

[9] T. Onodera, K. Sadakane, T. Shibuya, Detecting superbubbles in as¬ 
sembly graphs, in: WABI, 2013, pp. 338-348. 

[10] W. Sung, K. Sadakane, T. Shibuya, A. Belorkar, I. Pyrogova, An 0(m 
log m)-time algorithm for detecting superbubbles, IEEE/ACM Trans. 
Comput. Biology Bioinform. 12 (4) (2015) 770-777. 

[11] R. L. R. Thomas H. Cornren, Charles E. Leiserson, C. Stein, Introduc¬ 
tion to Algorithms, MIT Press, Cambridge, MA., 2001. 

[12] R. Tarjan, Edge-disjoint spanning trees and depth-first search, Acta 
Informatica 6 (2) (1976) 171-185. 

[13] D. Harel, R. Tarjan, Fast algorithms for finding nearest common ances¬ 
tors, SIAM Journal on Computing 13(2) (1984) 338-355. 

[14] J. Fischer, V. Heun, Theoretical and practical improvements on the 
RMQ-problem, with applications to LCA and LCE, in: M. Lewenstein, 
G. Valiente (Eds.), Combinatorial Pattern Matching, 17th Annual Sym¬ 
posium, CPM 2006, Barcelona, Spain, July 5-7, 2006, Proceedings, Vol. 
4009 of Lecture Notes in Computer Science, Springer, 2006, pp. 36-48. 

[15] S. Durocher, A simple linear-space data structure for constant-time 
range minimum query, in: A. Brodnik, A. Lpez-Ortiz, V. Raman, A. Vi¬ 
ola (Eds.), Space-Efficient Data Structures, Streams, and Algorithms, 
Vol. 8066 of Lecture Notes in Computer Science, Springer Berlin Hei¬ 
delberg, 2013, pp. 48-60. 

[16] S. Nurk, A. Bankevich, D. Antipov, A. A. Gurevich, A. Korobeynikov, 
A. Lapidus, A. D. Prjibelski, A. Pyshkin, A. Sirotkin, Y. Sirotkin, 
R. Stepanauskas, S. R. Clingenpeel, T. Woyke, J. S. McLean, R. Lasken, 
G. Teslcr, M. A. Alekseyev, P. A. Pevzner, Assembling single-cell 
genomes and mini-metagenomes from chimeric MDA products, Journal 
of Computational Biology 20 (10) (2013) 714-737. 


21 



